INN Hotels Project¶
Context¶
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
- Loss of resources (revenue) when the hotel cannot resell the room.
- Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
- Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
- Human resources to make arrangements for the guests.
Objective¶
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Data Description¶
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
- Booking_ID: unique identifier of each booking
- no_of_adults: Number of adults
- no_of_children: Number of Children
- no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected – No meal plan selected
- Meal Plan 1 – Breakfast
- Meal Plan 2 – Half board (breakfast and one other meal)
- Meal Plan 3 – Full board (breakfast, lunch, and dinner)
- required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
- room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
- lead_time: Number of days between the date of booking and the arrival date
- arrival_year: Year of arrival date
- arrival_month: Month of arrival date
- arrival_date: Date of the month
- market_segment_type: Market segment designation.
- repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
- no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
- no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
- avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
- no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
- booking_status: Flag indicating if the booking was canceled or not.
Problem Definition¶
Analyze the data of INN Hotels to find which factors have a high influence on booking cancellations which lead to revenue losses, reduced profit margins, inefficient resource allocation, and build a predictive model that can predict which booking is going to be cancelled in advance, and help in formulating profitable policies for cancellations and refunds.
Importing necessary libraries and data¶
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
Import Dataset¶
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# Code to let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Code to read dataset
hotel = pd.read_csv('/content/drive/My Drive/INNHotelsGroup.csv')
# copying data to another variable to avoid any changes to original data
data = hotel.copy()
Data Overview¶
Observations
Sanity checks
Observations to check if data has been uploaded properly or not. Here, I check the first five rows and the last 5 rows of the dataset
Sanity checks to get information about the number of rows and columns in the dataset, find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected, check the statistical summary of the dataset to get an overview of the numerical columns of the data, check for duplicate values, and missing values.
# Code to view the first 5 rows of the dataset
data.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
# Code to view the last 5 rows of the dataset
data.tail()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80000 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95000 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39000 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67000 | 0 | Not_Canceled |
Observations
The dataset has been uploaded properly, clearly identifying the columns and the rows in the dataset. We can proceed to check the shape of the data to know exactly how many columns and rows are there.
# Code to check the shape of the dataset
data.shape
print('There are', data.shape[0],'rows and', data.shape[1],'columns')
There are 36275 rows and 19 columns
Observations
The dataset has 36275 rows and 19 columns.
We can proceed to check the datatypes of the different columns in the dataset.
# Code to determine datatypes of different columns in the dataset
info = data.info()
print(info)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB None
Observations
Of the 19 columns in the dataset, we have 1 float, 13 integers, and 5 objects (Booking ID, Type of Meal Plan, Room Type Reserved, Market Segment Type, and Booking Status are all strings). Memory usage is 5.3+ MB.
There appears to be no missing values in the entries. We can confirm this using the Python function data.isnull().sum() but first let's check the statistical summary of the dataset.
# Code to check the statistical summary of dataset
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.00000 | 1.84496 | 0.51871 | 0.00000 | 2.00000 | 2.00000 | 2.00000 | 4.00000 |
| no_of_children | 36275.00000 | 0.10528 | 0.40265 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| no_of_weekend_nights | 36275.00000 | 0.81072 | 0.87064 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 7.00000 |
| no_of_week_nights | 36275.00000 | 2.20430 | 1.41090 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 17.00000 |
| required_car_parking_space | 36275.00000 | 0.03099 | 0.17328 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| lead_time | 36275.00000 | 85.23256 | 85.93082 | 0.00000 | 17.00000 | 57.00000 | 126.00000 | 443.00000 |
| arrival_year | 36275.00000 | 2017.82043 | 0.38384 | 2017.00000 | 2018.00000 | 2018.00000 | 2018.00000 | 2018.00000 |
| arrival_month | 36275.00000 | 7.42365 | 3.06989 | 1.00000 | 5.00000 | 8.00000 | 10.00000 | 12.00000 |
| arrival_date | 36275.00000 | 15.59700 | 8.74045 | 1.00000 | 8.00000 | 16.00000 | 23.00000 | 31.00000 |
| repeated_guest | 36275.00000 | 0.02564 | 0.15805 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| no_of_previous_cancellations | 36275.00000 | 0.02335 | 0.36833 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 13.00000 |
| no_of_previous_bookings_not_canceled | 36275.00000 | 0.15341 | 1.75417 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.00000 |
| avg_price_per_room | 36275.00000 | 103.42354 | 35.08942 | 0.00000 | 80.30000 | 99.45000 | 120.00000 | 540.00000 |
| no_of_special_requests | 36275.00000 | 0.61966 | 0.78624 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 5.00000 |
Observations
- Minimum Average price per room is 0.00 euros. The first quartile is 80.30 Euros. The third quartile is 120 euros. The maximum is 540 euros. The median is 99.45 euros.
- Maximum number of previous cancellations is 13
- Maximum number of previous bookings not cancelled is 58.
- Average price per room is 103.42 euros.
- Average number of adults in a room is 1.844, approximately 2. The median is 2. The maximum is 4.
- Minimum lead time is 0.00, with maximum of 443. Median is 57 and mean is 85.9. There could be some skewness here. -Maximum number of children is 10. Average is less than 1. No number is detected for the first quartile, median, and third quartile. This could be an indication that most parents do not bring their children with them.
# Code to check missing values in the dataset
data.isnull().sum()
| 0 | |
|---|---|
| Booking_ID | 0 |
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
Observations
There are no missing values in the dataset.
# Code to check for duplicates
data.duplicated().sum()
0
Observations
There are no duplicated entries in the dataset.
With no missing values and duplicated entries, the data is ready for detailed analysis but first let's drop the Booking_ID column first before we proceed.
#Code to drop Booking-ID so we can proceed
data.drop('Booking_ID',axis=1,inplace=True)
data.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
Observations
Booking ID has been dropped.
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- What are the busiest months in the hotel?
- Which market segment do most of the guests come from?
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
- What percentage of bookings are canceled?
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
Univariate Analysis¶
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage labels.
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of counts (default False)
n: number of bars to display (default None, displays all bars)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
order=data[feature].value_counts().index[:n].sort_values(),
palette="Set2",
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
Observations on No_of_Adults¶
labeled_barplot(data, "no_of_adults", perc=True)
data['no_of_adults'].unique()
array([2, 1, 3, 0, 4])
def get_bookings_by_adults(data, num_adults):
"""
Gets the number of bookings for a given number of adults.
Args:
data: The pandas DataFrame containing booking data.
num_adults: The number of adults to filter by.
Returns:
The number of bookings with the specified number of adults.
"""
bookings_for_adults = data[data['no_of_adults'] == num_adults]
num_bookings = bookings_for_adults.shape[0]
return num_bookings
# Get bookings for different adult counts
adult_counts = [0, 1, 2, 3, 4]
for num_adults in adult_counts:
num_bookings = get_bookings_by_adults(data, num_adults)
print(f"Number of bookings for {num_adults} adult(s): {num_bookings}")
Number of bookings for 0 adult(s): 139 Number of bookings for 1 adult(s): 7695 Number of bookings for 2 adult(s): 26108 Number of bookings for 3 adult(s): 2317 Number of bookings for 4 adult(s): 16
Observations
- 2 adults have the highest count of bookings(26108), followed by booking for 1 adult(7695), and booking for 3 adults(2317).
- Most of the hotel stays are either for couples or two adults traveling together.
Observations on number of children¶
labeled_barplot(data, "no_of_children", perc=True) ## Complete the code to create labeled_barplot for number of children
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
def get_bookings_by_children(data, num_children):
"""
Gets the number of bookings for a given number of children.
Args:
data: The pandas DataFrame containing booking data.
num_children: The number of children to filter by.
Returns:
The number of bookings with the specified number of children.
"""
bookings_for_children = data[data['no_of_children'] == num_children]
num_bookings = bookings_for_children.shape[0]
return num_bookings
# Get bookings for different children counts
children_counts = [0, 1, 2, 3, 4]
for num_children in children_counts:
num_bookings = get_bookings_by_children(data, num_children)
print(f"Number of bookings for {num_children} child(ren): {num_bookings}")
Number of bookings for 0 child(ren): 33577 Number of bookings for 1 child(ren): 1618 Number of bookings for 2 child(ren): 1058 Number of bookings for 3 child(ren): 22 Number of bookings for 4 child(ren): 0
Observations
The bar plot shows majority of bookings without children (33577)
Observations on number of weekend nights¶
labeled_barplot(data,'no_of_weekend_nights')
def get_bookings_by_weekend_nights(data, num_nights):
"""
Gets the number of bookings for a given number of weekend nights.
Args:
data: The pandas DataFrame containing booking data.
num_nights: The number of weekend nights to filter by.
Returns:
The number of bookings with the specified number of weekend nights.
"""
bookings_for_nights = data[data['no_of_weekend_nights'] == num_nights]
num_bookings = bookings_for_nights.shape[0]
return num_bookings
# Get bookings for different weekend night counts
weekend_night_counts = [0, 1, 2, 3, 4, 5, 6, 7]
for num_nights in weekend_night_counts:
num_bookings = get_bookings_by_weekend_nights(data, num_nights)
print(f"Number of bookings for {num_nights} weekend night(s): {num_bookings}")
Number of bookings for 0 weekend night(s): 16872 Number of bookings for 1 weekend night(s): 9995 Number of bookings for 2 weekend night(s): 9071 Number of bookings for 3 weekend night(s): 153 Number of bookings for 4 weekend night(s): 129 Number of bookings for 5 weekend night(s): 34 Number of bookings for 6 weekend night(s): 20 Number of bookings for 7 weekend night(s): 1
Observations
- The largest category, in fact, is bookings without any weekend nights, meaning most stays are weekday stays or do not extend to the weekend.
- There are one or two weekend nights suggesting that some visitors spend the night during at least part of the weekend, presumably during shorter pleasure trips.
- Overall, the data suggests that most bookings are completely within the week or have only one or two nights during the weekend, while longer weekend stays are very rare.
Observations on number of week nights¶
labeled_barplot(data,'no_of_week_nights')
# Filter bookings with more than 5 weeknights
filtered_bookings = data[data['no_of_week_nights'] > 5]
# Count filtered bookings
num_filtered_bookings = filtered_bookings.shape[0]
# Calculate the percentage
percentage = (num_filtered_bookings / data.shape[0]) * 100
# Print the result
print(f"Percentage of bookings extending beyond 5 weeknights: {percentage:.2f}%")
Percentage of bookings extending beyond 5 weeknights: 1.41%
Observations
- The highest counts are for 1, 2, and 3 weeknights in bookings. This tells us that most guests prefer to stay on weekdays and not for so many days.
- The more the weeknights, the lower the count of bookings, meaning longer weekday stays are less likely.
- Less than 2% of all bookings extend beyond 5 weeknights; stays over 10 weeknights are extremely rare.
Observations on type of meal plan¶
labeled_barplot(data,'type_of_meal_plan')
Observations
- Meal Plan 1 dominates: Most bookings have chosen Meal Plan 1; this therefore means that most guests prefer that meal plan.
- Meal Plan 2 has fewer bookings
- Quite a remarkable number of bookings did not select any meal plan; this may mean some guests want flexibility in their stay or independently arrange for meals.
Observations on required car parking space¶
labeled_barplot(data,'required_car_parking_space')
total_parking_required = data['required_car_parking_space'].sum()
print(f"Total car parking spaces required: {total_parking_required}")
Total car parking spaces required: 1124
Observations
- Most bookings did not request a car parking space. Either the guests do not need parking or could be using alternative means of transportation that did not require parking.
- Out of the 36275 bookings, only 1124 ( about 3%) required parking.
Observations on room type reserved¶
labeled_barplot(data,'room_type_reserved')
def calculate_room_type_percentages(data):
"""
Calculates the percentage of reservations for each room type.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of reservations for each room type.
"""
room_type_counts = data['room_type_reserved'].value_counts()
total_bookings = data.shape[0]
room_type_percentages = (room_type_counts / total_bookings) * 100
return room_type_percentages
# Calculate and print the percentages
room_type_percentages = calculate_room_type_percentages(data)
print(room_type_percentages)
room_type_reserved Room_Type 1 77.54652 Room_Type 4 16.69745 Room_Type 6 2.66299 Room_Type 2 1.90765 Room_Type 5 0.73053 Room_Type 7 0.43556 Room_Type 3 0.01930 Name: count, dtype: float64
Observations
- Most of the bookings fall under Room Type 1(77%), which means that this room type is most popular among its guests.
- Room Type 4 enjoys a fair number of bookings but less when compared with Room Type 1. It accounts for 16%.
- Room Types 2, 3, 5, 6, and 7 have comparably low booking counts, hence it is likely that these rooms are unpopular with guests. This could be one of the factors accounting for the increasing number of cancellations.
- Room Type 1 serves most guests' needs or offers a better value for money for guests. Hence, their preferred choice.
Observations on lead time¶
histogram_boxplot(data, "lead_time") # Code to plot histogram and boxplot
Observations
- From the box plot, it can be observed that most lead times fall in a relatively small range-as indicated by the IQR.
- The presence of outliers to the right depicts some bookings are incredibly long in lead times as far as over 300 days.
- The histplot shows a right-skewed distribution. The highest count is concentrated within the shortest lead times close to 0, which shows that a lot of bookings were made shortly before arrival.
- The pattern indicates that last-minute bookings are common while some guests book in advance.
Observations on arrival year¶
labeled_barplot(data,'arrival_year')
def calculate_arrival_year_percentages(data):
"""
Calculates the percentage of bookings for each arrival year.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of bookings for each arrival year.
"""
arrival_year_counts = data['arrival_year'].value_counts()
total_bookings = data.shape[0]
arrival_year_percentages = (arrival_year_counts / total_bookings) * 100
return arrival_year_percentages
# Calculate and print the percentages
arrival_year_percentages = calculate_arrival_year_percentages(data)
print(arrival_year_percentages)
arrival_year 2018 82.04273 2017 17.95727 Name: count, dtype: float64
Observations
More guests arrived in 2018 (82%) than in 2017 (18%)
Observations on arrival month¶
labeled_barplot(data,'arrival_month')
def calculate_arrival_month_percentages(data):
"""
Calculates the percentage of bookings for each arrival month.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of bookings for each arrival month.
"""
arrival_month_counts = data['arrival_month'].value_counts()
total_bookings = data.shape[0]
arrival_month_percentages = (arrival_month_counts / total_bookings) * 100
return arrival_month_percentages
# Calculate and print the percentages
arrival_month_percentages = calculate_arrival_month_percentages(data)
print(arrival_month_percentages)
arrival_month 10 14.65748 9 12.71123 8 10.51137 6 8.82977 12 8.32805 11 8.21502 7 8.04962 4 7.54238 5 7.16196 3 6.50034 2 4.69745 1 2.79531 Name: count, dtype: float64
Observations
- The month of October has the highest number of arrivals (14% of the total). It could be identified as the peak season for the hotel.
- August and September also have a high number of bookings.
- Arrival counts are stable from March to July, showing only moderate booking levels, hence probably consistent but not peak demand. This could be due to the Spring and Summer seasons.
- January and February, depict lower counts of booking. This could probably be due to the Winter season.
- INN Hotels Group could build its marketing and resource allocation strategy around the different seasons of the year to improve cost management and profitability.
Observations on market segment type¶
labeled_barplot(data,'market_segment_type')
def calculate_market_segment_percentages(data):
"""
Calculates the percentage of bookings for each market segment type.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of bookings for each market segment type.
"""
market_segment_counts = data['market_segment_type'].value_counts()
total_bookings = data.shape[0]
market_segment_percentages = (market_segment_counts / total_bookings) * 100
return market_segment_percentages
# Calculate and print the percentages
market_segment_percentages = calculate_market_segment_percentages(data)
print(market_segment_percentages)
market_segment_type Online 63.99449 Offline 29.02274 Corporate 5.56030 Complementary 1.07788 Aviation 0.34459 Name: count, dtype: float64
Observations
- The market is segmented into Aviation, Complementary, Corporate, Offline, and Online.
- Online accounts for the largest number of guests (64%), followed by offline (29%).
- Corporate bookings make up a smaller segment of the market (6%).
- Aviation and Complementary contribute minimally to the hotel's bookings.
- This affords INN Hotels Group the opportunity to do a Customer Profitability Analysis, allowing it to determine which customer segments are profitable and which ones are not, and to know where sales effort should be concentrated.
Observations on repeated guest¶
labeled_barplot(data,'repeated_guest') # Code to determine repeated guest
repeated_guest_percentage = (data['repeated_guest'].sum() / len(data)) * 100
print(f"Percentage of repeated guests: {repeated_guest_percentage:.2f}%")
Percentage of repeated guests: 2.56%
Observations
- 2.56% of the customers are repeated guests. 97.44% did not repeat. This is a cause for concern and must be investigated. Could it be that they did not have a great customer experience the first time?
- There is also the possibility that 97.44% are first time customers.
Observations on repeated guest¶
labeled_barplot(data,'no_of_previous_cancellations') # Code to visualize number of previous cancellations
# Percentage of bookings with previous cancellations
bookings_with_previous_cancellations = data[data['no_of_previous_cancellations'] > 0]
percentage_with_previous_cancellations = (len(bookings_with_previous_cancellations) / len(data)) * 100
print(f"Percentage of bookings with previous cancellations: {percentage_with_previous_cancellations:.2f}%")
# Percentage of bookings with NO previous cancellations
bookings_with_no_previous_cancellations = data[data['no_of_previous_cancellations'] == 0]
percentage_with_no_previous_cancellations = (len(bookings_with_no_previous_cancellations) / len(data)) * 100
print(f"Percentage of bookings with NO previous cancellations: {percentage_with_no_previous_cancellations:.2f}%")
Percentage of bookings with previous cancellations: 0.93% Percentage of bookings with NO previous cancellations: 99.07%
Observations
- A large number of bookings are made by guests with zero previous cancellations (99.07%).
- There are very few customers who repeatedly cancel their bookings (0.93%) INN Hotels Group should continue with whatever effective cancellation policy they have in place to discourage frequent cancellation.
Observations on number of previous booking not canceled¶
histogram_boxplot(data,'no_of_previous_bookings_not_canceled') # Code to create histogram and boxplot
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Observations
- Most of the values are around zero; only a few guests had a large number of prior non-canceled bookings
Observations on average price per room¶
histogram_boxplot(data,'avg_price_per_room') ## Code to create histogram_boxplot for average price per room
data[data["avg_price_per_room"] == 0]
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 2 | 2017 | 9 | 10 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 145 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 13 | 2018 | 6 | 1 | Complementary | 1 | 3 | 5 | 0.00000 | 1 | Not_Canceled |
| 209 | 1 | 0 | 0 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2018 | 2 | 27 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 266 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2017 | 8 | 12 | Complementary | 1 | 0 | 1 | 0.00000 | 1 | Not_Canceled |
| 267 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2017 | 8 | 23 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35983 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 6 | 7 | Complementary | 1 | 4 | 17 | 0.00000 | 1 | Not_Canceled |
| 36080 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 3 | 21 | Complementary | 1 | 3 | 15 | 0.00000 | 1 | Not_Canceled |
| 36114 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 3 | 2 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
| 36217 | 2 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 2 | 3 | 2017 | 8 | 9 | Online | 0 | 0 | 0 | 0.00000 | 2 | Not_Canceled |
| 36250 | 1 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 6 | 2017 | 12 | 10 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
545 rows × 18 columns
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
| count | |
|---|---|
| market_segment_type | |
| Complementary | 354 |
| Online | 191 |
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25) ## Code to calculate 25th quantile for average price per room
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75) ## Code to calculate 75th quantile for average price per room
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
179.55
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
avg_price = data['avg_price_per_room'].mean()
print(f"The average price of bookings is: {avg_price:.2f} euros")
The average price of bookings is: 103.41 euros
Observations
- Average price of booking is 103.41 euros
- The boxplot provides a fairly wide interquartile range, which indicates some variability in the room prices. A number of outliers at the higher end may suggest that most prices fall within a certain range but that a few bookings have substantially higher average room prices-perhaps because some rooms are premium, or perhaps simply because of high demand.
- The median reflects that most bookings are at prices that are not very high.
- The histogram distribution is right-skewed and groups the room prices in the lower-middle range, around 50 to 150.
- There are very few bookings at price points above 300 euros.
Observations on number of special requests¶
labeled_barplot(data,'no_of_special_requests') ## Code to create labeled_barplot for number of special requests
Observations
- Majority of guests did not make special requests. They did not need additional accommodation or services.
- There were 11373 bookings with one special requests, followed by 4364 bookigs with 2 special requests, 675 bookings with 3 special requests, 78 bookigs with 4 special requests, and 8 bookings with 5 special requests.
- The count of bookings decline as the number of special requests increase.
Observations on booking status¶
labeled_barplot(data,'booking_status') ## Code to create labeled_barplot for booking status
Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
Observations
- A greater part of the bookings(24,390) were not canceled, an indication that most guests conclude with their bookings.
- 11,885 cancelled bookings is quite a high rate.
- INN Hotels Group should implement flexible policies that will disuade guests from cancelling.
Bivariate Analysis¶
Correlation¶
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Observations
- The heatmap reveals some striking correlations: for repeat guests, more previous non-canceled bookings are seen with a low cancellation rate, whereas first-time guests cancel more. However, mostly the low values of most correlations may suggest many of these features being relatively independent, which could reflect rather diverse booking patterns across the guests.
- Some relationship between no_of_previous_bookings_not_canceled and repeated_guest (0.54) can be established.
- no_of_previous_bookings_not_canceled and no_of_previous_cancellations also show some slight relationship (0.47).
- Some slight relationship can be established between avg_price_per_room and no_of_children (0.35)
- The closer the correlation co-efficient is to 1, the greater the relationship. +1 indicates a perfect positive relationship. -1 indicates a perfect negative relationship. 0 indicates no relationship at all.
Creating functions that will help us with further analysis.
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Analysis to see how booking status impacts no. of adults, no. of children, no. of weekend nights, no. of week nights, type of meal plan, required car parking space, room type reserved, lead time, arrival year, arrival month, market segment type, repeated guest, no. of previous cancellations, no. of previous bookings not cancelled, no of special requests.¶
Booking Status vs Number of Adults¶
stacked_barplot(
data,"no_of_adults", "booking_status"
) # creates stacked barplot of booking status with respect to number of adults
booking_status 0 1 All no_of_adults All 24390 11885 36275 2 16989 9119 26108 1 5839 1856 7695 3 1454 863 2317 0 95 44 139 4 13 3 16 ------------------------------------------------------------------------------------------------------------------------
Observations
- Not cancelled(0) is more than cancelled(1) for booking status.
- Number of adults has only a minor impact on cancellation rates. Booking status is not dependent on number of adults.
Booking Status vs Number of Children¶
stacked_barplot(
data,"no_of_children", "booking_status"
) # creates stacked barplot of booking status with respect to number of children
booking_status 0 1 All no_of_children All 24390 11885 36275 0 22695 10882 33577 1 1078 540 1618 2 601 457 1058 3 16 6 22 ------------------------------------------------------------------------------------------------------------------------
Observations
- There is no relationship between number of children and booking status. It is clear from previous analysis (univariate analysis) that booking is done mostly by adults, unaccompanied by children.
Booking Status vs Number of Weekend Nights¶
stacked_barplot(
data,"no_of_weekend_nights", "booking_status"
) # Code to create stacked barplot of booking status with respect to number of weekend nights
booking_status 0 1 All no_of_weekend_nights All 24390 11885 36275 0 11779 5093 16872 1 6563 3432 9995 2 5914 3157 9071 4 46 83 129 3 79 74 153 5 5 29 34 6 4 16 20 7 0 1 1 ------------------------------------------------------------------------------------------------------------------------
Observations
- There is a relationship between booking status and number of weekend nights.
- Bookings with higher weekend nights (eg 7, 5, 6, 4) have a higher rate of cancellation.
- Bookings with lower weekend nights have lower rates of cancellation.
- There is uncertainty with longer weekend nights, the possibility of cancellation.
Booking Status vs Number of Week Nights¶
stacked_barplot(
data,"no_of_week_nights", "booking_status"
) # Code to create stacked barplot of booking status with respect to number of week nights
booking_status 0 1 All no_of_week_nights All 24390 11885 36275 2 7447 3997 11444 3 5265 2574 7839 1 6916 2572 9488 4 1847 1143 2990 0 1708 679 2387 5 982 632 1614 6 101 88 189 10 9 53 62 7 61 52 113 8 30 32 62 9 13 21 34 11 3 14 17 15 2 8 10 12 2 7 9 13 0 5 5 14 3 4 7 16 0 2 2 17 1 2 3 ------------------------------------------------------------------------------------------------------------------------
Observations
- There is a relationship between booking status and number of week nights.
- Bookings with higher week nights (eg 16, 13, 10, 11, 12, 17, 6, 14) have a higher rate of cancellation.
- Bookings with lower week nights have lower rates of cancellation.
- There is uncertainty with longer week nights, the possibility of cancellation. Shorter week nights are likely not to be cancelled.
Booking Status vs type of meal plan¶
stacked_barplot(
data,"type_of_meal_plan", "booking_status"
) # Code to create stacked barplot of booking status with respect to type of meal plan
booking_status 0 1 All type_of_meal_plan All 24390 11885 36275 Meal Plan 1 19156 8679 27835 Not Selected 3431 1699 5130 Meal Plan 2 1799 1506 3305 Meal Plan 3 4 1 5 ------------------------------------------------------------------------------------------------------------------------
Observations
- Some relationship can be observed between the type of meal plan chosen and the likelihood of cancellation.
- Meal plan 2 has the highest cancellation rate, followed by Meal plan 1.
Booking Status vs Required car parking space¶
stacked_barplot(
data,"required_car_parking_space", "booking_status"
) # Code to create stacked barplot of booking status with respect to required car parking space
booking_status 0 1 All required_car_parking_space All 24390 11885 36275 0 23380 11771 35151 1 1010 114 1124 ------------------------------------------------------------------------------------------------------------------------
Observations
- Some relationship can be established between booking status and reserved car park space.
- Guests who request for car parking space are likely not to cancel
Booking Status vs Room Type Reserved¶
stacked_barplot(
data,"room_type_reserved", "booking_status"
) # Code to create stacked barplot of booking status with respect to room type reserved
booking_status 0 1 All room_type_reserved All 24390 11885 36275 Room_Type 1 19058 9072 28130 Room_Type 4 3988 2069 6057 Room_Type 6 560 406 966 Room_Type 2 464 228 692 Room_Type 5 193 72 265 Room_Type 7 122 36 158 Room_Type 3 5 2 7 ------------------------------------------------------------------------------------------------------------------------
Observation
The relationship between booking status and the possibility of cancellation is not strong.
Booking Status vs Arrival year¶
stacked_barplot(
data,"arrival_year", "booking_status"
) # Code to create stacked barplot of booking status with respect to arrival year
booking_status 0 1 All arrival_year All 24390 11885 36275 2018 18837 10924 29761 2017 5553 961 6514 ------------------------------------------------------------------------------------------------------------------------
Observations
- There were more cancellations in 2018 than in 2017. That indicates some relationship between booking status and arrival year. The higher cancellations in 2018 could be due to several factors such as changes in economic conditions, changes in taste and behavior etc.
Booking Status vs Arrival month¶
stacked_barplot(
data,"arrival_month", "booking_status"
) # Code to create stacked barplot of booking status with respect to arrival month
booking_status 0 1 All arrival_month All 24390 11885 36275 10 3437 1880 5317 9 3073 1538 4611 8 2325 1488 3813 7 1606 1314 2920 6 1912 1291 3203 4 1741 995 2736 5 1650 948 2598 11 2105 875 2980 3 1658 700 2358 2 1274 430 1704 12 2619 402 3021 1 990 24 1014 ------------------------------------------------------------------------------------------------------------------------
Observations
We can establish a relationship between arrival month and the likelihood of cancellation.
Cancellation rates are higher in the summer ( eg June, July, and August) and lower in the winter (eg Dec, Jan, February). In between, cancellations are fairly stable.
Booking Status vs Market Segment Type¶
stacked_barplot(
data,"market_segment_type", "booking_status"
) # Code to create stacked barplot of booking status with respect to market segment type
booking_status 0 1 All market_segment_type All 24390 11885 36275 Online 14739 8475 23214 Offline 7375 3153 10528 Corporate 1797 220 2017 Aviation 88 37 125 Complementary 391 0 391 ------------------------------------------------------------------------------------------------------------------------
Observation
- Online segment has a highest cancellation rate. This is followed by offline. It is interesting to note that's where INN Hotels Group gets most of its business from.
- There's high probability that more cancellations would come from online and offline, followed by aviation.
- Some relationship can be established between booking status and market segment type.
Booking Status vs Repeated Guest¶
stacked_barplot(
data,"repeated_guest", "booking_status"
) # Code to create stacked barplot of booking status with respect to repeated guest
booking_status 0 1 All repeated_guest All 24390 11885 36275 0 23476 11869 35345 1 914 16 930 ------------------------------------------------------------------------------------------------------------------------
Observations
- Repeated guests have lower cancellation rates than non-repeated guests.
- There is a high correlation between repeated guests and non- cancellation. Customer loyalty programs should be introduced to continue to enhance this.
Booking Status vs Number of Previous Cancellations¶
stacked_barplot(
data,"no_of_previous_cancellations", "booking_status"
) # Code to create stacked barplot of booking status with respect to no. of previous cancellations
booking_status 0 1 All no_of_previous_cancellations All 24390 11885 36275 0 24068 11869 35937 1 187 11 198 13 0 4 4 3 42 1 43 2 46 0 46 4 10 0 10 5 11 0 11 6 1 0 1 11 25 0 25 ------------------------------------------------------------------------------------------------------------------------
Observations
- We can establish some relationship between the number of previous cancellations and the current booking. Guests with high prior cancellations are much more likely to cancel, while those guests who have fewer or moderate past cancellations have much lower risks.
- We can gain insight into areas where the hotel could identify the guests that are at the higher risk of cancellation based on their history of booking.
Booking Status vs Number of Previous bookings not cancelled¶
stacked_barplot(
data,"no_of_previous_bookings_not_canceled", "booking_status"
) # Code to create stacked barplot of booking status with respect to no. of previous bookings not cancelled
booking_status 0 1 All no_of_previous_bookings_not_canceled All 24390 11885 36275 0 23585 11878 35463 1 224 4 228 12 11 1 12 4 64 1 65 6 35 1 36 2 112 0 112 44 2 0 2 43 1 0 1 42 1 0 1 41 1 0 1 40 1 0 1 38 1 0 1 39 1 0 1 46 1 0 1 37 1 0 1 36 1 0 1 35 1 0 1 45 1 0 1 48 2 0 2 47 1 0 1 33 1 0 1 49 1 0 1 50 1 0 1 51 1 0 1 52 1 0 1 53 1 0 1 54 1 0 1 55 1 0 1 56 1 0 1 57 1 0 1 58 1 0 1 34 1 0 1 31 2 0 2 32 2 0 2 3 80 0 80 5 60 0 60 7 24 0 24 8 23 0 23 9 19 0 19 10 19 0 19 11 15 0 15 13 7 0 7 14 9 0 9 15 8 0 8 16 7 0 7 17 6 0 6 18 6 0 6 19 6 0 6 20 6 0 6 21 6 0 6 22 6 0 6 23 3 0 3 24 3 0 3 25 3 0 3 26 2 0 2 27 3 0 3 28 2 0 2 29 2 0 2 30 2 0 2 ------------------------------------------------------------------------------------------------------------------------
data[data["no_of_previous_bookings_not_canceled"] != 0][
"booking_status"
].value_counts()
| count | |
|---|---|
| booking_status | |
| 0 | 805 |
| 1 | 7 |
# Filter for non-canceled bookings
non_canceled_bookings = data[data['booking_status'] == 0]
# Count non-canceled bookings
num_non_canceled = non_canceled_bookings.shape[0]
# Calculate percentage
percentage_non_canceled = (num_non_canceled / data.shape[0]) * 100
# Print the result
print(f"Percentage of bookings not canceled: {percentage_non_canceled:.2f}%")
Percentage of bookings not canceled: 67.24%
Observations
- A strong relationship between the number of previously non-canceled bookings and the likelihood of cancellation can be established.
Booking Status vs No. of Special requests¶
stacked_barplot(
data,"no_of_special_requests", "booking_status"
) # Code to create stacked barplot of booking status with respect to no. of special requests
booking_status 0 1 All no_of_special_requests All 24390 11885 36275 0 11232 8545 19777 1 8670 2703 11373 2 3727 637 4364 3 675 0 675 4 78 0 78 5 8 0 8 ------------------------------------------------------------------------------------------------------------------------
Observations
- We can establish a relationship between the booking status and the number of special requests.
- As the number of special requests increase, the number of booking cancellations decrease.
Question 1: What are the busiest months in the hotel?¶
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()
def calculate_arrival_month_percentages(data):
"""
Calculates the percentage of bookings for each arrival month.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of bookings for each arrival month.
"""
arrival_month_counts = data['arrival_month'].value_counts()
total_bookings = data.shape[0]
arrival_month_percentages = (arrival_month_counts / total_bookings) * 100
return arrival_month_percentages
# Calculate and print the percentages
arrival_month_percentages = calculate_arrival_month_percentages(data)
print(arrival_month_percentages)
arrival_month 10 14.65748 9 12.71123 8 10.51137 6 8.82977 12 8.32805 11 8.21502 7 8.04962 4 7.54238 5 7.16196 3 6.50034 2 4.69745 1 2.79531 Name: count, dtype: float64
Observations
- October, September, and August are the busiest months in the hotel, October being the highest.
Question 2: Which market segment do most of the guests come from?¶
labeled_barplot(data,'market_segment_type')
def calculate_market_segment_percentages(data):
"""
Calculates the percentage of bookings for each market segment type.
Args:
data: The pandas DataFrame containing booking data.
Returns:
A pandas Series containing the percentage of bookings for each market segment type.
"""
market_segment_counts = data['market_segment_type'].value_counts()
total_bookings = data.shape[0]
market_segment_percentages = (market_segment_counts / total_bookings) * 100
return market_segment_percentages
# Calculate and print the percentages
market_segment_percentages = calculate_market_segment_percentages(data)
print(market_segment_percentages)
market_segment_type Online 63.99449 Offline 29.02274 Corporate 5.56030 Complementary 1.07788 Aviation 0.34459 Name: count, dtype: float64
Observations
- Most guests come from the online segment, representing 64%
- 29% come from offline
Question 3: Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?¶
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room"
)
plt.show()
# plot of data frame made of just the market_segment_type and avg_price_per_room
total = len(data["market_segment_type"]) # length of the column
count = data["market_segment_type"].nunique() # counts amount of unique values for
plt.figure(figsize=(12, 7)) # sets figure size
plt.xticks(fontsize=15) # rotates tick labels 90 degrees
# create barplot
ax = sns.barplot(
data=data[["market_segment_type", "avg_price_per_room"]],
x="market_segment_type",
y="avg_price_per_room",
palette="deep", # sets color for plot
ci=None,
)
# creates labels on top of bars that are either counts or perentages of the whole column depending perc value
for p in ax.patches:
label = "€{:.2f}".format(p.get_height()) # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
# edits the labels to be the correct size and placement
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=15,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage or the count
plt.savefig(
"avg_room_price_per_market_segment.jpg", bbox_inches="tight"
) # saves plot as JPEG
plt.show() # show the plot
Observations
The differences in room prices in different market segments are:
- Online €112.26
- Offline €91.60
- Corporate €82.91
- Aviation €100.70
- Complementary €3.14
Question 4: What percentage of bookings are cancelled?¶
booking_status_counts = data['booking_status'].value_counts()
canceled_percentage = (booking_status_counts[1] / len(data)) * 100
print(f"Percentage of bookings canceled: {canceled_percentage:.2f}%")
Percentage of bookings canceled: 32.76%
Observations
- Percentage of booking cancelled is 32.76%
Question 5: Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?¶
data.groupby("booking_status")["repeated_guest"].value_counts()
| count | ||
|---|---|---|
| booking_status | repeated_guest | |
| 0 | 0 | 23476 |
| 1 | 914 | |
| 1 | 0 | 11869 |
| 1 | 16 |
# Filter for repeating guests
repeating_guests = data[data['repeated_guest'] == 1]
# Calculate cancellations within this group
repeating_guest_cancellations = repeating_guests[repeating_guests['booking_status'] == 1].shape[0]
# Calculate the percentage
percentage_repeating_guest_cancellations = (repeating_guest_cancellations / repeating_guests.shape[0]) * 100
# Print the result
print(f"Percentage of repeating guests who cancel: {percentage_repeating_guest_cancellations:.2f}%")
Percentage of repeating guests who cancel: 1.72%
Observations
- Percentage of repeating guests who cancel is 1.7%
Question 6: Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?¶
stacked_barplot(
data,"no_of_special_requests", "booking_status"
) # Code to create stacked barplot of booking status with respect to no. of special requests
booking_status 0 1 All no_of_special_requests All 24390 11885 36275 0 11232 8545 19777 1 8670 2703 11373 2 3727 637 4364 3 675 0 675 4 78 0 78 5 8 0 8 ------------------------------------------------------------------------------------------------------------------------
Observations
- We can establish a relationship between the booking status and the number of special requests.
- As the number of special requests increase, the number of booking cancellations decrease.
- These special requirements affect booking cancellation.
Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
Missing value treatement¶
data.isnull().sum() #Checking for missing values
| 0 | |
|---|---|
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
Observations
No missing value in the dataset
Outlier Check¶
- Let's check for outliers in the data.
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations
There are a couple of outliers. Since they are part of the dataset, they will not be treated.
Model Building¶
Building a Logistic Regression model¶
#Computing different functions to check performance
def model_performance_classification_statsmodels(model,predictors,target,threshold=0.5):
#Checking which probabilities are greater than the threshold
pred_temp = model.predict(predictors)>threshold
#Rounding off the variables
pred = np.round(pred_temp)
#Metrics being used for model performance
acc = accuracy_score(target,pred) #To compute the accuracy score
recall = recall_score(target,pred) #To compute the recall score
precision = precision_score(target,pred) #To compute the precision score
f1 = f1_score(target,pred) #To compute the f1 score
#Creating a dataframe for the metrics
df_perf = pd.DataFrame({'Accuracy':acc,'Recall':recall,'Precision':precision,'F1':f1,},index=[0],)
return df_perf
def model_performance_classification_statsmodels(model, predictors, target, threshold=0.5):
# Convert predictors to a NumPy array with a suitable data type
predictors = np.asarray(predictors, dtype=np.float64)
#Checking which probabilities are greater than the threshold
pred_temp = model.predict(predictors)>threshold
#Rounding off the variables
pred = np.round(pred_temp)
#Calculating the performance metrics
Accuracy = accuracy_score(target, pred)
#print('Accuracy', Accuracy)
Recall = recall_score(target, pred)
#print('Recall', Recall)
Precision = precision_score(target, pred)
#print('Precision', Precision)
F1_Score = f1_score(target, pred)
#print('F1_Score', F1_Score)
#Creating a dataframe to display the results
df_perf = pd.DataFrame(
{
"Accuracy": [Accuracy],
"Recall": [Recall],
"Precision": [Precision],
"F1_Score": [F1_Score],
}
)
return df_perf
#Computing the confusion matrix
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Data Preparation for Modeling
#Independent and dependent variables defined
x = data.drop(['booking_status'],axis=1)
y = data['booking_status']
#Adding constant
X = sm.add_constant(x)
X = pd.get_dummies(X,drop_first=True)
#Create a train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print('Shape of the training set:',X_train.shape)
print('Shape of test set:', X_test.shape)
print('Percentage of Classes in Training Set:',y_train.value_counts(normalize=True))
print('Percentage of Classes in Test Set:',y_test.value_counts(normalize=True))
Shape of the training set: (25392, 28) Shape of test set: (10883, 28) Percentage of Classes in Training Set: booking_status 0 0.67064 1 0.32936 Name: proportion, dtype: float64 Percentage of Classes in Test Set: booking_status 0 0.67638 1 0.32362 Name: proportion, dtype: float64
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
) ## Complete the code to create dummies for independent features
X.head()
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00000 | 2 | 0 | 1 | 2 | 0 | 224 | 2017 | 10 | 2 | 0 | 0 | 0 | 65.00000 | 0 | False | False | False | False | False | False | False | False | False | False | False | True | False |
| 1 | 1.00000 | 2 | 0 | 2 | 3 | 0 | 5 | 2018 | 11 | 6 | 0 | 0 | 0 | 106.68000 | 1 | False | False | True | False | False | False | False | False | False | False | False | False | True |
| 2 | 1.00000 | 1 | 0 | 2 | 1 | 0 | 1 | 2018 | 2 | 28 | 0 | 0 | 0 | 60.00000 | 0 | False | False | False | False | False | False | False | False | False | False | False | False | True |
| 3 | 1.00000 | 2 | 0 | 0 | 2 | 0 | 211 | 2018 | 5 | 20 | 0 | 0 | 0 | 100.00000 | 0 | False | False | False | False | False | False | False | False | False | False | False | False | True |
| 4 | 1.00000 | 2 | 0 | 1 | 1 | 0 | 48 | 2018 | 4 | 11 | 0 | 0 | 0 | 94.50000 | 0 | False | False | True | False | False | False | False | False | False | False | False | False | True |
# Converting the input attributes into float type for modeling
X = X.astype(float)
X.head()
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00000 | 2.00000 | 0.00000 | 1.00000 | 2.00000 | 0.00000 | 224.00000 | 2017.00000 | 10.00000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 65.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 |
| 1 | 1.00000 | 2.00000 | 0.00000 | 2.00000 | 3.00000 | 0.00000 | 5.00000 | 2018.00000 | 11.00000 | 6.00000 | 0.00000 | 0.00000 | 0.00000 | 106.68000 | 1.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| 2 | 1.00000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 0.00000 | 1.00000 | 2018.00000 | 2.00000 | 28.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| 3 | 1.00000 | 2.00000 | 0.00000 | 0.00000 | 2.00000 | 0.00000 | 211.00000 | 2018.00000 | 5.00000 | 20.00000 | 0.00000 | 0.00000 | 0.00000 | 100.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| 4 | 1.00000 | 2.00000 | 0.00000 | 1.00000 | 1.00000 | 0.00000 | 48.00000 | 2018.00000 | 4.00000 | 11.00000 | 0.00000 | 0.00000 | 0.00000 | 94.50000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
Building the Model
#Fitting the logistic regression model
logit = sm.Logit(y_train,X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Thu, 07 Nov 2024 Pseudo R-squ.: 0.3292
Time: 17:28:26 Log-Likelihood: -10794.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -922.8266 120.832 -7.637 0.000 -1159.653 -686.000
no_of_adults 0.1137 0.038 3.019 0.003 0.040 0.188
no_of_children 0.1580 0.062 2.544 0.011 0.036 0.280
no_of_weekend_nights 0.1067 0.020 5.395 0.000 0.068 0.145
no_of_week_nights 0.0397 0.012 3.235 0.001 0.016 0.064
required_car_parking_space -1.5943 0.138 -11.565 0.000 -1.865 -1.324
lead_time 0.0157 0.000 58.863 0.000 0.015 0.016
arrival_year 0.4561 0.060 7.617 0.000 0.339 0.573
arrival_month -0.0417 0.006 -6.441 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.259 0.796 -0.003 0.004
repeated_guest -2.3472 0.617 -3.806 0.000 -3.556 -1.139
no_of_previous_cancellations 0.2664 0.086 3.108 0.002 0.098 0.434
no_of_previous_bookings_not_canceled -0.1727 0.153 -1.131 0.258 -0.472 0.127
avg_price_per_room 0.0188 0.001 25.396 0.000 0.017 0.020
no_of_special_requests -1.4689 0.030 -48.782 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1756 0.067 2.636 0.008 0.045 0.306
type_of_meal_plan_Meal Plan 3 17.3584 3987.836 0.004 0.997 -7798.656 7833.373
type_of_meal_plan_Not Selected 0.2784 0.053 5.247 0.000 0.174 0.382
room_type_reserved_Room_Type 2 -0.3605 0.131 -2.748 0.006 -0.618 -0.103
room_type_reserved_Room_Type 3 -0.0012 1.310 -0.001 0.999 -2.568 2.566
room_type_reserved_Room_Type 4 -0.2823 0.053 -5.304 0.000 -0.387 -0.178
room_type_reserved_Room_Type 5 -0.7189 0.209 -3.438 0.001 -1.129 -0.309
room_type_reserved_Room_Type 6 -0.9501 0.151 -6.274 0.000 -1.247 -0.653
room_type_reserved_Room_Type 7 -1.4003 0.294 -4.770 0.000 -1.976 -0.825
market_segment_type_Complementary -40.5975 5.65e+05 -7.19e-05 1.000 -1.11e+06 1.11e+06
market_segment_type_Corporate -1.1924 0.266 -4.483 0.000 -1.714 -0.671
market_segment_type_Offline -2.1946 0.255 -8.621 0.000 -2.694 -1.696
market_segment_type_Online -0.3995 0.251 -1.590 0.112 -0.892 0.093
========================================================================================================
Observations
- Lead time, parking car space, repeat guests, special requests, and the room price are the major predictors of cancellation.
- The longer the lead time and higher the room price, the greater the likelihood of cancellation, while repeat guests, guests with special requests, and those requiring parking are more committed to their bookings.
- However, there are some anomalies, such as large standard errors for some categories.
- Variables like no_of_adults, no_of_children, no_of_weekend_nights, required_car_parking_space, lead_time, arrival_year, repeated_guest, no_of_previous_cancellations, avg_price_per_room, and no_of_special_requests have statistically significant coefficients, with p-values less than 0.05. This means those features are meaningful predictors of the booking status.
- The pseudo R-squared value of 0.3292 suggests that about 32.92% of the variance in booking status being canceled or not canceled is explained by this model.
#Training Performance
print('Training Performance')
model_performance_classification_statsmodels(lg,X_train,y_train)
Training Performance
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.80600 | 0.63410 | 0.73971 | 0.68285 |
Checking Multicollinearity¶
- In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
# Use Varianc Inflation Factor(VIF) to fix the multicollienarity issue
def checking_vif(predictors):
# Select only numeric columns
numeric_predictors = predictors.select_dtypes(include=['number'])
# Drop rows with any NaN values in numeric columns
numeric_predictors = numeric_predictors.dropna()
vif = pd.DataFrame()
vif['Features'] = numeric_predictors.columns
# Calculating VIF for each feature
vif['VIF'] = [variance_inflation_factor(numeric_predictors.values, i) for i in range(len(numeric_predictors.columns))]
return vif
checking_vif(X_train)
| Features | VIF | |
|---|---|---|
| 0 | const | 34866123.37597 |
| 1 | no_of_adults | 1.21461 |
| 2 | no_of_children | 1.17572 |
| 3 | no_of_weekend_nights | 1.05287 |
| 4 | no_of_week_nights | 1.06931 |
| 5 | required_car_parking_space | 1.03415 |
| 6 | lead_time | 1.15845 |
| 7 | arrival_year | 1.26424 |
| 8 | arrival_month | 1.24097 |
| 9 | arrival_date | 1.00486 |
| 10 | repeated_guest | 1.56324 |
| 11 | no_of_previous_cancellations | 1.37579 |
| 12 | no_of_previous_bookings_not_canceled | 1.63420 |
| 13 | avg_price_per_room | 1.39658 |
| 14 | no_of_special_requests | 1.11675 |
Observation
- VIF < 5: The variable has low multicollinearity and no significant issue
# Convert all columns in X to numeric if possible.
# Errors='coerce' will replace non-numeric values with NaN.
for col in X.select_dtypes(include=['object']).columns:
try:
# Explicitly convert to numeric, handle errors by setting to NaN
X[col] = pd.to_numeric(X[col], errors='coerce')
except (ValueError, TypeError):
# If conversion fails, drop the column and print a warning
print(f"Column '{col}' cannot be converted to numeric and will be dropped.")
X = X.drop(columns=[col])
# Impute NaN values using a strategy appropriate for numeric data
# Consider using mean, median, or a more sophisticated imputation method.
# Here, we use the mean for demonstration.
for col in X.select_dtypes(include=np.number).columns:
X[col] = X[col].fillna(X[col].mean())
# Check if y is a pandas Series and convert it to a numpy array
if isinstance(y, pd.Series):
y = y.to_numpy()
# *** The fix: Convert all columns in X to numeric dtype ***
X = X.astype(float)
Remove P-values
#Initial list of columns
cols = X_train.columns.tolist()
#Setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
#Defining the train set
x_train_aux = X_train[cols]
# *** The fix: Ensure all columns in x_train_aux are numeric ***
for col in x_train_aux.select_dtypes(include=['object']).columns:
try:
x_train_aux[col] = pd.to_numeric(x_train_aux[col], errors='coerce')
except (ValueError, TypeError):
print(f"Column '{col}' in x_train_aux cannot be converted to numeric and will be dropped.")
x_train_aux = x_train_aux.drop(columns=[col])
# If the column is dropped, also remove it from 'cols' to avoid further errors
if col in cols:
cols.remove(col)
x_train_aux = x_train_aux.astype(float)
#Fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
#Getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
#Name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
New Logit Model
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Thu, 07 Nov 2024 Pseudo R-squ.: 0.3282
Time: 17:28:29 Log-Likelihood: -10810.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -915.6391 120.471 -7.600 0.000 -1151.758 -679.520
no_of_adults 0.1088 0.037 2.914 0.004 0.036 0.182
no_of_children 0.1531 0.062 2.470 0.014 0.032 0.275
no_of_weekend_nights 0.1086 0.020 5.498 0.000 0.070 0.147
no_of_week_nights 0.0417 0.012 3.399 0.001 0.018 0.066
required_car_parking_space -1.5947 0.138 -11.564 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.213 0.000 0.015 0.016
arrival_year 0.4523 0.060 7.576 0.000 0.335 0.569
arrival_month -0.0425 0.006 -6.591 0.000 -0.055 -0.030
repeated_guest -2.7367 0.557 -4.916 0.000 -3.828 -1.646
no_of_previous_cancellations 0.2288 0.077 2.983 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.336 0.000 0.018 0.021
no_of_special_requests -1.4698 0.030 -48.884 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1642 0.067 2.469 0.014 0.034 0.295
type_of_meal_plan_Not Selected 0.2860 0.053 5.406 0.000 0.182 0.390
room_type_reserved_Room_Type 2 -0.3552 0.131 -2.709 0.007 -0.612 -0.098
room_type_reserved_Room_Type 4 -0.2828 0.053 -5.330 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7364 0.208 -3.535 0.000 -1.145 -0.328
room_type_reserved_Room_Type 6 -0.9682 0.151 -6.403 0.000 -1.265 -0.672
room_type_reserved_Room_Type 7 -1.4343 0.293 -4.892 0.000 -2.009 -0.860
market_segment_type_Corporate -0.7913 0.103 -7.692 0.000 -0.993 -0.590
market_segment_type_Offline -1.7854 0.052 -34.363 0.000 -1.887 -1.684
==================================================================================================
print('Training Performance')
model_performance_classification_statsmodels(lg1,X_train1,y_train)
Training Performance
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.80545 | 0.63267 | 0.73907 | 0.68174 |
Observations
F1 score changed slightly. This is not a significant change in the Logistic Regression.
Converting coefficients to odds¶
- The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
- Therefore, odds = exp(b)
- The percentage change in odds is given as odds = (exp(b) - 1) * 100
#Converting coefficients to odds
odds = np.exp(lg1.params.astype(np.float64)) # Convert lg1.params to float64
#Finding the percentage change
perc_change_odds = (np.exp(lg1.params.astype(np.float64)) - 1) * 100 # Convert lg1.params to float64
#Removing limit from number of columns to display
pd.set_option("display.max_columns", None)
#Adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.00000 | 1.11491 | 1.16546 | 1.11470 | 1.04258 | 0.20296 | 1.01583 | 1.57195 | 0.95839 | 0.06478 | 1.25712 | 1.01937 | 0.22996 | 1.17846 | 1.33109 | 0.70104 | 0.75364 | 0.47885 | 0.37977 | 0.23827 | 0.45326 | 0.16773 |
| Change_odd% | -100.00000 | 11.49096 | 16.54593 | 11.46966 | 4.25841 | -79.70395 | 1.58331 | 57.19508 | -4.16120 | -93.52180 | 25.71181 | 1.93684 | -77.00374 | 17.84641 | 33.10947 | -29.89588 | -24.63551 | -52.11548 | -62.02290 | -76.17294 | -54.67373 | -83.22724 |
Observations
- Odds of cancelling booking: No of adults increases by 11.5%, No. of children increases by 16.5%, No. of weekend nights increases by 11.4%, No. of week nights increases by 4.3%, Required car parking space decreases by 79.7%, Lead time increases by 1.6%, Arrival year increases by 57.2%, Arrival month decreases by 4.2%, Repeated guest decreases by 93.5%, No. of previous cancellations increases by 25.7%, Average price per room increases by 1.9%, No. of special requests decrease by 77.0%, Type of meal plan increases by 17.8% and 33%, Room type reserved and market segment witness a decrease.
Model performance evaluation¶
# creating confusion matrix
# Convert X_train1 columns to numeric type before prediction
X_train1_numeric = X_train1.astype(float) #Convert all columns to float64
confusion_matrix_statsmodels(lg1, X_train1_numeric, y_train)
log_model_train_perf = model_performance_classification_statsmodels(lg1,X_train1,y_train)
ROC-AUC¶
- ROC-AUC on training set
# Before prediction, ensure all columns are numeric and handle potential non-numeric values
X_train1 = X_train1.apply(pd.to_numeric, errors='coerce').fillna(0)
# Convert X_train1 columns to numeric type before prediction
X_train1_numeric = X_train1.astype(np.float64) # Use np.float64 for consistency
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1_numeric)) # Pass X_train1_numeric to predict
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1_numeric)) # Pass X_train1_numeric to predict
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Observations
The model's performance is good.
Optimal threshold using AUC-ROC curve¶
# Before prediction, ensure all columns are numeric and handle potential non-numeric values
X_train1 = X_train1.apply(pd.to_numeric, errors='coerce').fillna(0)
# Convert X_train1 columns to numeric type before prediction
# Explicitly cast all columns to float64
X_train1 = X_train1.astype(np.float64)
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3700522558708252
Confusion matrix using 0.37 as threshold
confusion_matrix_statsmodels(lg1,X_train1,y_train,threshold=optimal_threshold_auc_roc)
Observations
Model captures more cancellations accurately (higher true positives) but has a higher rate of false positives, meaning it might slightly overpredict cancellations.
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.79265 | 0.73622 | 0.66808 | 0.70049 |
Observations
There has been a significant improvement in the F1 Score.
Using the Precision-Recall curve to see if we can find a better threshold¶
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
Observations
We find a threshold at 0.42
Confusion matrix using threshold of 0.42
# setting the threshold
optimal_threshold_curve = 0.42
# Use optimal_threshold_curve as the threshold
confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.80132 | 0.69939 | 0.69797 | 0.69868 |
Observations
There is a slight drop in the F1 score.
Checking the performance on the test set using the default threshold
# Assuming X_test1 is a pandas DataFrame or a NumPy array
X_test1 = X_test1.astype(np.float64) # If it's a DataFrame
# or
X_test1 = X_test1.astype(float) # If it's a NumPy array
# Now, call your confusion matrix function
confusion_matrix_statsmodels(lg1, X_test1, y_test)
#Metrics
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1,X_test1,y_test) ## Complete the code to check performance on X_test1 and y_test
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.80465 | 0.63089 | 0.72900 | 0.67641 |
Observations
The model is performing well in the test performance, with an accuracy of 80.47%, indicating that the model mostly correctly predicts the booking status. With a recall of 63.09%, the model is only moderately good at identifying cancellations. Precision is 72.90%, which also indicates that true positives are okay for the prediction of cancellations when the model predicts positive.
The F1 score of the model would be 67.64%, which indicates that, on average, the model produces a performance that is satisfactory but far from excellent on cancellations. Overall, the model is doing okay, though recall could be better for picking cancellations.
- ROC curve on test set
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Observations
- With 0.86, the model is good at drawing a distinction between cancelled and non- cancelled booking.
- There is a strong balance between true positive rate and false positive rate.
- The model is reliable for predicting booking cancellations.
Using model with threshold=0.37
confusion_matrix_statsmodels(lg1,X_test1,y_test,threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.79555 | 0.73964 | 0.66573 | 0.70074 |
Observations
- The model with 0.37 threshold is more sensitive to booking cancellations than the model with threshold of 0.50.
Using model with threshold=0.42
# Get predicted probabilities
y_pred_prob = lg1.predict(X_test1)
# Calculate precision and recall for different thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
# Find the optimal threshold based on your criteria (e.g., F1-score)
# This is just an example, adapt it to your specific requirements
f1_scores = 2 * (precision * recall) / (precision + recall)
optimal_threshold_recall_precision = thresholds[np.argmax(f1_scores)]
# Now you can use the optimal threshold in your confusion matrix function call
confusion_matrix_statsmodels(lg1, X_test1, y_test, optimal_threshold_recall_precision)
#Metrics
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_recall_precision
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1_Score | |
|---|---|---|---|---|
| 0 | 0.79224 | 0.75923 | 0.65427 | 0.70285 |
Observations
- The model is good for effectively capturing cancellations.
Final Model Summary¶
# Assuming 'model_performance_classification_statsmodels' is a defined function
# and 'X_train1', 'y_train' are your training data
log_reg_model_train_perf = model_performance_classification_statsmodels(lg1, X_train1, y_train)
#Model Comparison Training Set
models_train_comp_df = pd.concat(
[
log_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80545 | 0.79265 | 0.80132 |
| Recall | 0.63267 | 0.73622 | 0.69939 |
| Precision | 0.73907 | 0.66808 | 0.69797 |
| F1_Score | 0.68174 | 0.70049 | 0.69868 |
Observations
- There is no overfitting or underfitting in any of the models.
- The models have similar F1 Scores.
Building a Decision Tree model¶
Data Preparation for modeling (Decision Tree)
- We want to predict which bookings will be canceled.
- Before we proceed to build a model, we'll have to encode categorical features.
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.
#Creating independent and dependent variables
X = data.drop(['booking_status'],axis=1)
Y = data['booking_status']
#Create dummy variables
X = pd.get_dummies(X, drop_first=True)
#Splitting for training and test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:",y_train.value_counts(normalize=True))
print("Percentage of classes in test set:",y_test.value_counts(normalize=True))
Shape of Training set : (25392, 27) Shape of test set : (10883, 27) Percentage of classes in training set: booking_status 0 0.67064 1 0.32936 Name: proportion, dtype: float64 Percentage of classes in test set: booking_status 0 0.67638 1 0.32362 Name: proportion, dtype: float64
Functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.¶
- The model_performance_classification_sklearn function will be used to check the model performance of models.
- The confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Building Decision Tree Model¶
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Checking model performance on training set¶
confusion_matrix_sklearn(model,X_train,y_train)
decision_tree_perf_train_default = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train_default
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.98661 | 0.99578 | 0.99117 |
Observations
- The model has a strong ability to correctly identify class 0; True Negatives of 16994.
- The model has good predictive performance.
- The F1 Score is almost a perfect score- 99.1%
Checking model performance on test set¶
confusion_matrix_sklearn(model,X_test,y_test)
decision_tree_perf_test_default = model_performance_classification_sklearn(model,X_test,y_test) ## Complete the code to check performance on test set
decision_tree_perf_test_default
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87118 | 0.81175 | 0.79461 | 0.80309 |
Observations
- The F1 Score is 80%, which is less than the score we got from the training set (99%).
- There could be overfitting of the data.
Before pruning the tree let's check the important features.
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
- Lead time is the most important predictor of booking cancellation by guests
Do we need to prune the tree?¶
Pre-Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)Checking performance on training set¶
confusion_matrix_sklearn(estimator,X_train,y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator,X_train,y_train)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83097 | 0.78608 | 0.72425 | 0.75390 |
Observations
- The pre-pruning has reduced the F1 Score of the training set from 99% to 75%
- The model works relatively well, and there is great room for further improvement to bring down the numbers of false positives and false negatives.
Checking performance on test set¶
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator,X_test,y_test)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83497 | 0.78336 | 0.72758 | 0.75444 |
Observations
- Pre-pruning has reduced the F1 score from 80% to 75%, an indication that overfitting has been reduced.
Visualizing the Decision Tree¶
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 133.59] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
- Before prepruning, lead time Lead time was the most important predictor of booking cancellation by guests, followed by average price per room. After pre-pruning, lead time is still the most important predictor of booking cancellation by guests. Market segment type online is now the second most important predictor.
Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | 0.00838 |
| 1 | 0.00000 | 0.00838 |
| 2 | 0.00000 | 0.00838 |
| 3 | 0.00000 | 0.00838 |
| 4 | 0.00000 | 0.00838 |
| ... | ... | ... |
| 1839 | 0.00890 | 0.32806 |
| 1840 | 0.00980 | 0.33786 |
| 1841 | 0.01272 | 0.35058 |
| 1842 | 0.03412 | 0.41882 |
| 1843 | 0.08118 | 0.50000 |
1844 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train) ## Complete the code to fit decision tree on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.0811791438913696
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
F1 Score vs alpha for training and testing sets¶
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167043,
class_weight='balanced', random_state=1)
Checking performance on training set¶
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89954 | 0.90303 | 0.81274 | 0.85551 |
Observations
- There is a significant increase in the F1 Score after cost complexity pruning.
- Overfitting has reduced
- Improved precision and recall
Checking performance on test set¶
#Confusion Matrix
confusion_matrix_sklearn(best_model, X_test, y_test)
#Metrics
decision_tree_test = model_performance_classification_sklearn(best_model,X_train,y_train)
decision_tree_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89954 | 0.90303 | 0.81274 | 0.85551 |
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | |--- weights: [207.26, 10.63] class: 0 | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_previous_bookings_not_canceled > 0.50 | | | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [21.62, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [1199.59, 1.52] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- arrival_month <= 9.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- weights: [41.75, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- avg_price_per_room <= 59.75 | | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | | |--- weights: [1.49, 12.14] class: 1 | | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | | |--- weights: [14.91, 1.52] class: 0 | | | | | | | | | |--- avg_price_per_room > 59.75 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.75, 59.21] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | | |--- weights: [20.13, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | |--- arrival_month > 9.50 | | | | | | | |--- weights: [413.04, 27.33] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- weights: [8.20, 25.81] class: 1 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- weights: [55.17, 3.04] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- weights: [21.62, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | | |--- weights: [9.69, 122.97] class: 1 | | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- weights: [2.24, 118.41] class: 1 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- arrival_date <= 11.50 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- arrival_date > 11.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [23.11, 6.07] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [5.96, 9.11] class: 1 | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- weights: [59.64, 3.04] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | |--- weights: [1.49, 16.70] class: 1 | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 86.00 | | | | | | | | | | | |--- weights: [2.24, 16.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.00 | | | | | | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [44.73, 4.55] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 11.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [16.40, 39.47] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- weights: [20.13, 6.07] class: 0 | | | | | | |--- arrival_date > 11.50 | | | | | | | |--- avg_price_per_room <= 102.09 | | | | | | | | |--- weights: [5.22, 144.22] class: 1 | | | | | | | |--- avg_price_per_room > 102.09 | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.75, 16.70] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [33.55, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | |--- avg_price_per_room <= 124.25 | | | | | | | | | | |--- weights: [2.98, 75.91] class: 1 | | | | | | | | | |--- avg_price_per_room > 124.25 | | | | | | | | | | |--- weights: [3.73, 3.04] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [38.02, 0.00] class: 0 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [24.60, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [14.91, 72.87] class: 1 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [9.69, 1.52] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [84.25, 0.00] class: 0 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [13.42, 13.66] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [0.00, 15.18] class: 1 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- weights: [58.15, 18.22] class: 0 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- weights: [61.88, 1.52] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 70.05 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.05 | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [38.77, 1.52] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- weights: [34.30, 40.99] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 74.21 | | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | | | |--- avg_price_per_room > 74.21 | | | | | | | | | | | |--- weights: [9.69, 0.00] class: 0 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [4.47, 10.63] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [155.07, 6.07] class: 0 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [3.73, 10.63] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- weights: [63.37, 30.36] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | |--- weights: [115.56, 12.14] class: 0 | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [28.33, 3.04] class: 0 | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | |--- weights: [0.75, 22.77] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | |--- avg_price_per_room <= 118.50 | | | | | | | | | |--- weights: [18.64, 59.21] class: 1 | | | | | | | | |--- avg_price_per_room > 118.50 | | | | | | | | | |--- weights: [8.20, 1.52] class: 0 | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | |--- weights: [34.30, 171.55] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [26.09, 1.52] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 14.00 | | | | | | | | | | |--- weights: [9.69, 36.43] class: 1 | | | | | | | | | |--- arrival_date > 14.00 | | | | | | | | | | |--- avg_price_per_room <= 208.67 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 208.67 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | |--- lead_time <= 84.50 | | | | | | | | |--- weights: [50.70, 7.59] class: 0 | | | | | | | |--- lead_time > 84.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_date <= 27.00 | | | | | | | | | | |--- lead_time <= 131.50 | | | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | | | | | |--- lead_time > 131.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 27.00 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- weights: [20.88, 6.07] class: 0 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 71.34 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- lead_time <= 68.50 | | | | | | | | | | | |--- weights: [15.66, 78.94] class: 1 | | | | | | | | | | |--- lead_time > 68.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- lead_time <= 102.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 102.00 | | | | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 71.34 | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 120.45 | | | | | | | | | |--- weights: [79.77, 9.11] class: 0 | | | | | | | | |--- avg_price_per_room > 120.45 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [3.73, 12.14] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [16.40, 47.06] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [0.00, 63.76] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 104.31 | | | | | | | | |--- lead_time <= 25.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [16.40, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- weights: [38.77, 118.41] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [23.11, 0.00] class: 0 | | | | | | | | |--- lead_time > 25.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [39.51, 185.21] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [73.81, 411.41] class: 1 | | | | | | | |--- avg_price_per_room > 104.31 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | | | | |--- weights: [0.75, 138.15] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [11.18, 6.07] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.75, 9.11] class: 1 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | | |--- lead_time <= 22.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 22.00 | | | | | | | | | | | |--- weights: [17.15, 83.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | | |--- weights: [12.67, 6.07] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- lead_time <= 63.00 | | | | | | | |--- weights: [15.66, 1.52] class: 0 | | | | | | |--- lead_time > 63.00 | | | | | | | |--- weights: [0.00, 7.59] class: 1 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- lead_time <= 105.00 | | | | | | | |--- weights: [0.75, 6.07] class: 1 | | | | | | |--- lead_time > 105.00 | | | | | | | |--- weights: [31.31, 13.66] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | |--- weights: [498.03, 40.99] class: 0 | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 13.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- weights: [58.90, 36.43] class: 0 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [33.55, 1.52] class: 0 | | | | | | |--- arrival_date > 13.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [123.76, 9.11] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- avg_price_per_room <= 126.33 | | | | | | | | | |--- weights: [32.80, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 126.33 | | | | | | | | | |--- weights: [9.69, 13.66] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [70.08, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [126.74, 1.52] class: 0 | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [4.47, 57.69] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- avg_price_per_room <= 71.93 | | | | | | | | | | | |--- weights: [54.43, 3.04] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.93 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 121.20 | | | | | | | | | | | |--- weights: [18.64, 6.07] class: 0 | | | | | | | | | | |--- avg_price_per_room > 121.20 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- lead_time <= 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- weights: [11.93, 10.63] class: 0 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [37.28, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- avg_price_per_room <= 119.20 | | | | | | | | | | | |--- weights: [9.69, 28.84] class: 1 | | | | | | | | | | |--- avg_price_per_room > 119.20 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | | |--- weights: [49.95, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | | |--- weights: [0.75, 18.22] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- lead_time <= 6.50 | | | | | | | | |--- weights: [32.06, 0.00] class: 0 | | | | | | | |--- lead_time > 6.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | |--- weights: [23.11, 1.52] class: 0 | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 93.09 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 93.09 | | | | | | | | | | | |--- weights: [77.54, 27.33] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [19.38, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- avg_price_per_room <= 202.95 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.49, 9.11] class: 1 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [175.20, 28.84] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | |--- avg_price_per_room > 202.95 | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- avg_price_per_room <= 153.15 | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- weights: [12.67, 7.59] class: 0 | | | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | | | |--- weights: [64.12, 60.72] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 153.15 | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- arrival_month <= 5.00 | | | | | | | |--- weights: [2.98, 0.00] class: 0 | | | | | | |--- arrival_month > 5.00 | | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [46.97, 9.11] class: 0 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [0.00, 13.66] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- weights: [188.62, 7.59] class: 0 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- weights: [13.42, 27.33] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 285.50 | | | | | | | |--- weights: [8.20, 0.00] class: 0 | | | | | | |--- lead_time > 285.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | | |--- weights: [2.24, 57.69] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [17.89, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [11.18, 3.04] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [75.30, 12.14] class: 0 | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [25.35, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [11.18, 264.15] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [46.22, 0.00] class: 0 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 324.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [7.46, 986.78] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | |--- lead_time > 324.50 | | | | | | | |--- avg_price_per_room <= 89.00 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 89.00 | | | | | | | | |--- weights: [0.75, 13.66] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [1.49, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- arrival_date <= 1.50 | | | | | | | |--- weights: [1.49, 3.04] class: 1 | | | | | | |--- arrival_date > 1.50 | | | | | | | |--- weights: [35.79, 1.52] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [7.46, 206.46] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | |--- weights: [46.97, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | |--- no_of_week_nights <= 6.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | |--- lead_time <= 152.50 | | | | | | | | | | | |--- weights: [1.49, 4.55] class: 1 | | | | | | | | | | |--- lead_time > 152.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | |--- weights: [23.11, 19.74] class: 0 | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [2.24, 15.18] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | |--- no_of_week_nights > 6.50 | | | | | | | | |--- weights: [4.47, 13.66] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- arrival_date <= 14.50 | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | |--- arrival_date > 14.50 | | | | | | | |--- weights: [11.18, 31.88] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
Lead time is still the most important predictor of booking cancellation.
Model Performance Comparison and Conclusions¶
#Training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_default.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99421 | 0.83097 | 0.89954 |
| Recall | 0.98661 | 0.78608 | 0.90303 |
| Precision | 0.99578 | 0.72425 | 0.81274 |
| F1 | 0.99117 | 0.75390 | 0.85551 |
#Testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_default.T,
decision_tree_tune_perf_test.T,
decision_tree_test.T,
],
axis=1,)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.87118 | 0.83497 | 0.89954 |
| Recall | 0.81175 | 0.78336 | 0.90303 |
| Precision | 0.79461 | 0.72758 | 0.81274 |
| F1 | 0.80309 | 0.75444 | 0.85551 |
Observations
- The default decision tree was overfitting. The training set had an F1 Score(almost a perfect score) which was higher than the testing set.
- Pre-pruning removed the overfitting. The cost complexity pruning (post-pruning) resulted in the best F1 Score for the decision tree.
Actionable Insights and Recommendations¶
- What profitable policies for cancellations and refunds can the hotel adopt?
- What other recommedations would you suggest to the hotel?
Actionable Insights¶
- No missing values and duplicated values identified in the dataset. That attests to the integrity of the data and its readiness for exploratory analysis. We should continue to ensure data integrity for effective analysis.
- The presence of outliers in the dataset indicates that some bookings are incredibly long in lead times as far as over 300 days. Shorter lead times should be preferred.
- Eliminating overfitting in the data was necessary before modeling.
- The month of October appeared to be the hotel's busiest month.
- Online segment has a highest cancellation rate. This is followed by offline. INN Hotels Group gets most of its business from these segments.
- There's high probability that more cancellations would come from online and offline market segments. Steps shoud be taken to reduce cancellations in order to improve revenue, profitability, and resource allocation.
- Some relationship can be established between booking status, lead time, and market segment type. Lead time is the most important predictor of customer default.
- INN Hotels Group could build its marketing and resource allocation strategy around the different seasons of the year to improve cost management and profitability.
Recommendations¶
- Undertake Customer Profitability Analysis for each market segment type to determine which segment is profitable and which one is not. This will enable the hotel to analyse the resources used in serving specific customers and compare these reources to the revenues generated from these customers.
- INN Hotels should nurture its relationship with their key customers, understand the cost of serving them so the hotel can meet their expectations in a cost-effective manner.
- INN Hotels should decide on a fair deadline for free cancellation. An appropriate fee should be charged after the free deadline.
- A No-Show policy should be developed and communicated clearly to the customer at the time of booking. A full rate could be charged on the first night or a different penalty could be considered.
- Demand full payment for the reservation
- A very important thing to include in INN Hotels cancellation poicy is the clause about force majeure. Force majeure refers to acts of God, political unrest, or a global pandemic like COVID. By making it possible with such a clause, INN Hotels may allow flexibility in cancellation with empathy to any guest who is affected by any event beyond their control.
- Build a Machine Learning-based solution to help in predicting booking cancellations.
- INN Hotels should have a refund policy which should specify the conditions and the method of refund. Is it a full refund, partial, or credit towards a future stay? These terms should be transparent and fair to all guests.
- In low-demand periods, strategic discounts or promotions may be deployed to attract more guests.